PyTorch Fundamentals
Reading time: ~50 minutes | Level: Intermediate-Advanced
The Silent Bug
The model trains fine for 10 epochs. Loss is decreasing. Then you notice something: the loss at epoch 2 is identical to epoch 1. So is epoch 3. The model is not learning at all.
import torch
import torch.nn as nn
model = nn.Linear(10, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
criterion = nn.MSELoss()
X = torch.randn(32, 10)
y = torch.randn(32, 1)
for epoch in range(5):
pred = model(X)
loss = criterion(pred, y)
loss.backward()
optimizer.step() # weights updated
# optimizer.zero_grad() is missing!
print(f"Epoch {epoch}: loss={loss.item():.4f}")
# Output:
# Epoch 0: loss=1.2341
# Epoch 1: loss=1.2341 <-- same!
# Epoch 2: loss=1.2341 <-- same!
Every epoch, loss.backward() accumulates gradients on top of the previous epoch's gradients. The optimizer's step() uses the sum of all accumulated gradients -- a signal that grows with each epoch and corrupts the update direction. Without optimizer.zero_grad(), the model is not training with the current batch's gradient; it is training with the sum of every gradient since the start.
This is the most common PyTorch bug. This lesson explains the mechanics of why it happens and gives you the mental model to never make it again.
Why This Matters
PyTorch is the dominant framework for research and increasingly for production deep learning. Unlike high-level frameworks that hide the training loop, PyTorch exposes it -- which means you have full control and full responsibility. Understanding the internals is not optional for ML engineering:
- Tensor internals determine memory usage, contiguity bugs, and GPU transfer costs
- Autograd is the foundation of every gradient-based algorithm; debugging loss spikes requires understanding the computation graph
- nn.Module is the building block of every architecture; hooks, buffers, and parameter registration enable techniques from distributed training to quantisation
- DataLoader/Dataset performance directly affects GPU utilisation
- Device management errors are among the most common runtime crashes
1. Tensor Internals: Storage, Strides, and Contiguity
A PyTorch tensor is not just an array of numbers. It is a view over a flat block of memory, described by four components: storage, dtype, shape, and strides.
import torch
# Create a tensor
x = torch.arange(12, dtype=torch.float32).reshape(3, 4)
print(x)
# tensor([[ 0., 1., 2., 3.],
# [ 4., 5., 6., 7.],
# [ 8., 9., 10., 11.]])
# Storage: the flat underlying data
print(x.storage()[:]) # tensor([0., 1., 2., ..., 11.])
# Strides: how many storage elements to skip to advance by 1 in each dimension
# For a 3x4 row-major tensor: stride=(4,1)
# To advance one row: skip 4 elements. To advance one column: skip 1 element.
print(x.stride()) # (4, 1)
print(x.storage_offset()) # 0 -- starts at the beginning of storage
# Transpose does NOT copy data -- it just swaps the strides
x_t = x.T
print(x_t.shape) # (4, 3)
print(x_t.stride()) # (1, 4) -- now column-major
# is_contiguous checks whether the strides match C-order layout
print(x.is_contiguous()) # True
print(x_t.is_contiguous()) # False -- the transpose is a non-contiguous view
# Operations that require contiguous memory will raise an error on x_t.
# Fix with .contiguous() which creates a new, C-order copy:
x_t_cont = x_t.contiguous()
print(x_t_cont.stride()) # (3, 1) -- now C-order for a 4x3 tensor
# Why this matters for ML:
# 1. view() requires contiguous tensors -- use reshape() which handles it automatically
# 2. Non-contiguous tensors transferred to GPU incur an extra copy
# 3. Some CUDA kernels require contiguous inputs (e.g. certain cuDNN operations)
try:
x_t.view(-1)
except RuntimeError as e:
print(e) # "view size is not compatible with input tensor's size and stride"
x_t.reshape(-1) # works -- internally calls contiguous() if needed
Memory sharing: slices and transposes share the same underlying storage. Writing to a slice modifies the original tensor.
a = torch.ones(4)
b = a[:2] # b is a VIEW of a's storage
b[0] = 99.0
print(a) # tensor([99., 1., 1., 1.]) -- a was modified!
# Use .clone() to get an independent copy
c = a[:2].clone()
c[0] = 0.0
print(a) # unchanged
2. Autograd: The Computation Graph
PyTorch builds a dynamic computation graph as operations execute. Every tensor that has requires_grad=True records its creation operation and its inputs. When you call .backward(), PyTorch traverses this graph in reverse, applying the chain rule to accumulate gradients.
import torch
# Leaf tensors -- created by the user, not by an operation
x = torch.tensor([2.0, 3.0], requires_grad=True)
# Operations on x create non-leaf tensors with grad_fn
y = x * 2 # y.grad_fn = MulBackward0
z = y + 1 # z.grad_fn = AddBackward0
loss = z.mean() # loss.grad_fn = MeanBackward0
print(x.is_leaf) # True -- created by user, has no grad_fn
print(y.is_leaf) # False -- created by an operation
print(loss.grad_fn) # <MeanBackward0 object>
# Backprop: computes d(loss)/d(x)
# loss = mean(x*2 + 1) = (2x1 + 1 + 2x2 + 1) / 2
# d(loss)/d(xi) = 2/2 = 1.0 for each element
loss.backward()
print(x.grad) # tensor([1., 1.])
# --- retain_graph ---
# By default, the computation graph is freed after .backward().
# If you need to backprop through the same graph twice (e.g. in MAML,
# or when computing higher-order derivatives), use retain_graph=True:
x2 = torch.tensor([1.0], requires_grad=True)
y2 = x2 ** 2
y2.backward(retain_graph=True) # x2.grad = 2.0
x2.grad.zero_() # clear gradient before second backward
y2.backward() # works because retain_graph=True was used
print(x2.grad) # 2.0
# --- detach ---
# Creates a tensor that shares storage but is not in the computation graph.
# Use for: stopping gradient flow, creating targets, computing metrics.
with torch.no_grad():
detached = y2.detach() # detached does not require grad
print(detached.requires_grad) # False
# Alternatively, use torch.no_grad() context manager for a block
with torch.no_grad():
# No graph is built here -- faster and memory efficient for inference
pred = model(X_test)
accuracy = (pred.argmax(dim=1) == y_test).float().mean()
Gradient accumulation: because .backward() adds to .grad rather than replacing it, you must call optimizer.zero_grad() before each backward pass. This is also the basis for intentional gradient accumulation (simulating large batch sizes):
ACCUMULATION_STEPS = 4 # simulate batch size 4x larger
optimizer.zero_grad()
for i, (X_batch, y_batch) in enumerate(loader):
pred = model(X_batch)
loss = criterion(pred, y_batch) / ACCUMULATION_STEPS # scale loss
loss.backward() # accumulate gradients
if (i + 1) % ACCUMULATION_STEPS == 0:
optimizer.step()
optimizer.zero_grad() # flush accumulated gradients
3. nn.Module Anatomy
nn.Module is the base class for all neural network components. Understanding its internals lets you write custom layers, debug architecture bugs, and use advanced features like hooks.
import torch
import torch.nn as nn
class TwoLayerMLP(nn.Module):
def __init__(self, in_features: int, hidden: int, out_features: int) -> None:
super().__init__()
# Attributes that are nn.Module subclasses are automatically registered
# as submodules and their parameters are included in self.parameters()
self.fc1 = nn.Linear(in_features, hidden)
self.relu = nn.ReLU()
self.drop = nn.Dropout(p=0.3)
self.fc2 = nn.Linear(hidden, out_features)
# nn.Parameter wraps a tensor and registers it as a learnable parameter
# This is how you add non-standard learnable weights (e.g. in custom attention)
self.scale = nn.Parameter(torch.ones(1))
# register_buffer registers a tensor that is NOT a parameter (not learnable)
# but IS part of the model state -- saved/loaded with state_dict.
# Use for: running statistics (BatchNorm), positional embeddings, masks.
self.register_buffer("bias_correction", torch.zeros(out_features))
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = self.fc1(x)
x = self.relu(x)
x = self.drop(x) # only active in training mode
x = self.fc2(x)
x = x * self.scale # learnable scaling factor
x = x + self.bias_correction # buffer used in forward pass
return x
model = TwoLayerMLP(10, 64, 4)
# Inspect parameters
for name, param in model.named_parameters():
print(f"{name:25s} shape={tuple(param.shape)} requires_grad={param.requires_grad}")
# fc1.weight shape=(64, 10) requires_grad=True
# fc1.bias shape=(64,) requires_grad=True
# fc2.weight shape=(4, 64) requires_grad=True
# fc2.bias shape=(4,) requires_grad=True
# scale shape=(1,) requires_grad=True
# Inspect buffers
for name, buf in model.named_buffers():
print(f"{name} shape={tuple(buf.shape)} requires_grad={buf.requires_grad}")
# bias_correction shape=(4,) requires_grad=False
# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total: {total_params}, Trainable: {trainable}")
Forward Hooks for Debugging
Hooks let you inspect or modify tensor values during the forward or backward pass without modifying the module.
import torch
import torch.nn as nn
model = TwoLayerMLP(10, 64, 4)
# A forward hook receives (module, input_tuple, output_tensor)
activations = {}
def capture_activation(module, input, output):
# Detach to avoid holding the computation graph in memory
activations[module] = output.detach().cpu()
# Register hook on a specific layer
hook = model.fc1.register_forward_hook(capture_activation)
# Run a forward pass
x = torch.randn(8, 10)
_ = model(x)
print(activations[model.fc1].shape) # (8, 64)
print(activations[model.fc1].mean()) # mean activation post-fc1
# Always remove hooks when done -- they persist and affect every forward pass
hook.remove()
# Gradient hooks for debugging vanishing/exploding gradients
def log_grad_norm(module, grad_input, grad_output):
for g in grad_output:
if g is not None:
print(f"{module.__class__.__name__} grad norm: {g.norm().item():.4f}")
hook_bwd = model.fc1.register_full_backward_hook(log_grad_norm)
# ... run forward + backward ...
hook_bwd.remove()
4. The Training Loop
The canonical PyTorch training loop, written defensively:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from typing import Callable
def train_epoch(
model: nn.Module,
loader: DataLoader,
criterion: Callable,
optimizer: torch.optim.Optimizer,
device: torch.device,
grad_clip: float | None = None,
) -> float:
"""Runs one training epoch. Returns mean loss."""
model.train() # enables Dropout, BatchNorm running stats update
total_loss = 0.0
for X_batch, y_batch in loader:
X_batch = X_batch.to(device, non_blocking=True) # async transfer
y_batch = y_batch.to(device, non_blocking=True)
# 1. Zero gradients BEFORE the forward pass
optimizer.zero_grad()
# 2. Forward pass
logits = model(X_batch)
loss = criterion(logits, y_batch)
# 3. Backward pass
loss.backward()
# 4. Optional gradient clipping (prevents gradient explosion in RNNs/Transformers)
if grad_clip is not None:
nn.utils.clip_grad_norm_(model.parameters(), max_norm=grad_clip)
# 5. Parameter update
optimizer.step()
total_loss += loss.item() # .item() extracts scalar, detaches from graph
return total_loss / len(loader)
@torch.no_grad() # disables gradient tracking for the entire function
def evaluate(
model: nn.Module,
loader: DataLoader,
criterion: Callable,
device: torch.device,
) -> tuple[float, float]:
"""Returns (mean_loss, accuracy)."""
model.eval() # disables Dropout, uses running stats for BatchNorm
total_loss = 0.0
correct = 0
total = 0
for X_batch, y_batch in loader:
X_batch = X_batch.to(device, non_blocking=True)
y_batch = y_batch.to(device, non_blocking=True)
logits = model(X_batch)
loss = criterion(logits, y_batch)
total_loss += loss.item()
preds = logits.argmax(dim=1)
correct += (preds == y_batch).sum().item()
total += y_batch.size(0)
return total_loss / len(loader), correct / total
def train(
model: nn.Module,
train_loader: DataLoader,
val_loader: DataLoader,
n_epochs: int = 50,
lr: float = 1e-3,
device: torch.device | None = None,
patience: int = 5,
) -> dict:
"""Full training loop with early stopping."""
if device is None:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=1e-4)
criterion = nn.CrossEntropyLoss()
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=n_epochs)
history = {"train_loss": [], "val_loss": [], "val_acc": []}
best_val_loss = float("inf")
no_improve = 0
best_state = None
for epoch in range(1, n_epochs + 1):
train_loss = train_epoch(model, train_loader, criterion, optimizer, device)
val_loss, val_acc = evaluate(model, val_loader, criterion, device)
scheduler.step()
history["train_loss"].append(train_loss)
history["val_loss"].append(val_loss)
history["val_acc"].append(val_acc)
print(f"Epoch {epoch:3d}/{n_epochs} "
f"train_loss={train_loss:.4f} val_loss={val_loss:.4f} val_acc={val_acc:.4f}")
# Early stopping
if val_loss < best_val_loss:
best_val_loss = val_loss
no_improve = 0
# Save the best weights (deep copy via state_dict)
best_state = {k: v.clone() for k, v in model.state_dict().items()}
else:
no_improve += 1
if no_improve >= patience:
print(f"Early stopping at epoch {epoch}")
break
# Restore best weights before returning
if best_state is not None:
model.load_state_dict(best_state)
return history
The five mandatory steps: (1) optimizer.zero_grad(), (2) forward pass, (3) loss.backward(), (4) optional clipping, (5) optimizer.step(). The only step that varies between implementations is whether zero_grad comes before or after step. Before is preferred because it makes the mental model cleaner.
5. Dataset and DataLoader
import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np
class TabularDataset(Dataset):
"""
Wraps numpy arrays as a PyTorch Dataset.
Dataset.__len__ and Dataset.__getitem__ are the only required methods.
PyTorch's DataLoader uses these to construct batches.
"""
def __init__(self, X: np.ndarray, y: np.ndarray) -> None:
# Convert once at construction, not per-item -- much faster
self.X = torch.from_numpy(X).float()
self.y = torch.from_numpy(y).long()
def __len__(self) -> int:
return len(self.X)
def __getitem__(self, idx: int) -> tuple[torch.Tensor, torch.Tensor]:
return self.X[idx], self.y[idx]
# DataLoader handles batching, shuffling, and multi-process loading
train_dataset = TabularDataset(X_train_np, y_train_np)
val_dataset = TabularDataset(X_val_np, y_val_np)
train_loader = DataLoader(
train_dataset,
batch_size=64,
shuffle=True, # shuffle training data each epoch
num_workers=4, # parallel data loading in background processes
pin_memory=True, # page-lock host memory for faster GPU transfer
drop_last=True, # drop the last incomplete batch (stabilises BatchNorm)
persistent_workers=True, # keep workers alive between epochs (reduces spawn cost)
)
val_loader = DataLoader(
val_dataset,
batch_size=128, # larger batch is fine for evaluation (no gradients)
shuffle=False, # never shuffle validation data
num_workers=2,
pin_memory=True,
)
# Inspect a batch
X_batch, y_batch = next(iter(train_loader))
print(X_batch.shape, y_batch.shape)
num_workers tuning: start with num_workers=4 and increase until GPU utilisation stops improving. On macOS, set num_workers=0 (the macOS multiprocessing fork model causes issues). On Colab/Kaggle, 2 is usually optimal.
6. GPU Device Management
import torch
import torch.nn as nn
# --- Device selection ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# On Apple Silicon:
# device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
print(f"Using device: {device}")
if device.type == "cuda":
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
# --- Moving tensors and models ---
model = TwoLayerMLP(10, 128, 4)
model = model.to(device) # moves all parameters and buffers
# Move tensors
x = torch.randn(32, 10).to(device) # using .to(device) -- preferred
x_gpu = torch.randn(32, 10).cuda() # older API -- hardcodes CUDA
# A common bug: mixing CPU and GPU tensors
try:
y_cpu = torch.randn(32, 10)
result = x + y_cpu # RuntimeError: CPU and CUDA tensors are not mixable
except RuntimeError as e:
print(e)
# Always move tensors to the same device as the model before the forward pass
# --- non_blocking transfers ---
# Without non_blocking: the CPU waits for the GPU to complete the transfer
# With non_blocking=True on pinned memory: transfer happens asynchronously;
# the CPU can prepare the next batch while the GPU receives the current one
X_batch = X_batch.to(device, non_blocking=True) # overlaps CPU/GPU work
# --- Checking and managing GPU memory ---
if device.type == "cuda":
print(torch.cuda.memory_allocated() / 1e6, "MB allocated")
print(torch.cuda.memory_reserved() / 1e6, "MB reserved (cached)")
torch.cuda.empty_cache() # release cached memory back to OS (does not free allocated memory)
# --- Context manager for mixed precision ---
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler() # scales loss to prevent fp16 underflow
for X_batch, y_batch in train_loader:
X_batch = X_batch.to(device)
y_batch = y_batch.to(device)
optimizer.zero_grad()
with autocast(): # runs forward pass in fp16 where safe, fp32 elsewhere
logits = model(X_batch)
loss = criterion(logits, y_batch)
scaler.scale(loss).backward() # scale gradients to avoid underflow
scaler.step(optimizer) # unscale before optimizer.step()
scaler.update() # adjust scale factor for next iteration
Mixed precision training (AMP) uses float16 for the forward pass and float32 for the weight update. It typically halves memory usage and runs 1.5-3x faster on modern GPUs with Tensor Cores.
7. Model Saving and Loading
import torch
import torch.nn as nn
from pathlib import Path
# --- state_dict: the recommended approach ---
def save_checkpoint(
model: nn.Module,
optimizer: torch.optim.Optimizer,
epoch: int,
loss: float,
path: str | Path,
) -> None:
"""
Saves model weights + optimizer state + training metadata.
Why save optimizer state? Optimizer has momentum/adaptive terms (Adam).
Without it, resuming training resets these -- the first few resumed epochs
behave differently from uninterrupted training.
"""
path = Path(path)
path.parent.mkdir(parents=True, exist_ok=True)
torch.save({
"epoch": epoch,
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
"loss": loss,
}, path)
print(f"Checkpoint saved: {path}")
def load_checkpoint(
path: str | Path,
model: nn.Module,
optimizer: torch.optim.Optimizer | None = None,
device: torch.device | None = None,
) -> dict:
"""
Loads a checkpoint. Returns the metadata dict.
map_location ensures tensors are loaded onto the target device,
not whatever device they were saved from.
"""
if device is None:
device = torch.device("cpu")
checkpoint = torch.load(path, map_location=device)
model.load_state_dict(checkpoint["model_state_dict"])
if optimizer is not None:
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
return checkpoint
# --- Inference-only export ---
# For deployment, save only the state_dict (no optimizer state)
torch.save(model.state_dict(), "model_weights.pt")
# Load for inference
model = TwoLayerMLP(10, 128, 4)
model.load_state_dict(torch.load("model_weights.pt", map_location="cpu"))
model.eval() # always set eval mode before inference
# --- TorchScript export (for C++ deployment or locked Python environments) ---
# Scripting traces the model's Python logic and freezes it into a graph
scripted = torch.jit.script(model)
scripted.save("model_scripted.pt")
loaded_script = torch.jit.load("model_scripted.pt")
with torch.no_grad():
out = loaded_script(torch.randn(1, 10))
# --- ONNX export (for deployment across frameworks: TensorRT, ONNXRuntime) ---
dummy_input = torch.randn(1, 10)
torch.onnx.export(
model, dummy_input, "model.onnx",
opset_version=17,
input_names=["features"],
output_names=["logits"],
dynamic_axes={"features": {0: "batch_size"}, "logits": {0: "batch_size"}},
)
8. Common Bugs Catalogue
Bug 1: Forgetting zero_grad (opening scenario)
# BAD
for X, y in loader:
pred = model(X)
loss = criterion(pred, y)
loss.backward()
optimizer.step() # gradients accumulate every step!
# GOOD
for X, y in loader:
optimizer.zero_grad()
pred = model(X)
loss = criterion(pred, y)
loss.backward()
optimizer.step()
Bug 2: In-place operations breaking autograd
# In-place operations modify a tensor's data without creating a new tensor.
# If an in-place op is applied to a tensor needed by the backward pass,
# autograd cannot compute the correct gradient and raises an error.
x = torch.randn(4, requires_grad=True)
y = x + 1
# BAD: y += 1 is in-place (y.__iadd__)
y += 1 # RuntimeError: a leaf Variable that requires grad has been used in an in-place operation
# GOOD: create a new tensor
y = y + 1 # out-of-place; y is now a new tensor
Bug 3: Calling model.eval() but forgetting torch.no_grad()
# model.eval() disables Dropout and uses BatchNorm running stats -- good.
# But it does NOT stop the computation graph from being built.
# Without torch.no_grad(), autograd still tracks every operation during inference.
model.eval()
# BAD: graph is built, memory is wasted
with torch.no_grad():
pass # missing!
outputs = model(X_test) # builds a graph nobody will use
# GOOD
model.eval()
with torch.no_grad():
outputs = model(X_test) # no graph built -- 2x faster, uses less memory
Bug 4: Calling .item() inside the loss accumulation loop
# .item() is cheap but it synchronises the CPU and GPU.
# Calling it on every batch in a tight training loop causes GPU pipeline stalls.
# ACCEPTABLE for debugging
for batch in loader:
...
print(loss.item()) # forces GPU sync on every batch
# BETTER: accumulate tensor losses, sync only once per epoch
epoch_loss = torch.tensor(0.0, device=device)
for batch in loader:
...
epoch_loss += loss.detach() # detach to avoid holding graph
avg_loss = (epoch_loss / len(loader)).item() # single sync at end of epoch
Bug 5: model.train() / model.eval() in the wrong place
# BatchNorm and Dropout behave differently in train vs eval mode.
# Forgetting to switch causes:
# - Dropout active during evaluation: non-deterministic, lower accuracy
# - BatchNorm uses batch stats instead of running stats during inference:
# predictions change with batch size
# ALWAYS:
model.train() # at the start of each training epoch
model.eval() # at the start of evaluation and inference
9. A Complete Minimal Example
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
import numpy as np
# --- Data ---
rng = np.random.default_rng(42)
n = 1000
X_np = rng.normal(size=(n, 20)).astype(np.float32)
w = rng.normal(size=(20,)).astype(np.float32)
y_np = (X_np @ w > 0).astype(np.int64)
X_train_t = torch.from_numpy(X_np[:800])
y_train_t = torch.from_numpy(y_np[:800])
X_val_t = torch.from_numpy(X_np[800:])
y_val_t = torch.from_numpy(y_np[800:])
train_loader = DataLoader(TensorDataset(X_train_t, y_train_t), batch_size=64, shuffle=True)
val_loader = DataLoader(TensorDataset(X_val_t, y_val_t), batch_size=200)
# --- Model ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = nn.Sequential(
nn.Linear(20, 64), nn.ReLU(), nn.Dropout(0.2),
nn.Linear(64, 32), nn.ReLU(),
nn.Linear(32, 2),
).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
criterion = nn.CrossEntropyLoss()
# --- Training loop ---
for epoch in range(20):
# Training
model.train()
for X_b, y_b in train_loader:
X_b, y_b = X_b.to(device), y_b.to(device)
optimizer.zero_grad()
loss = criterion(model(X_b), y_b)
loss.backward()
optimizer.step()
# Evaluation
model.eval()
with torch.no_grad():
X_val_d = X_val_t.to(device)
y_val_d = y_val_t.to(device)
logits = model(X_val_d)
val_acc = (logits.argmax(1) == y_val_d).float().mean().item()
if (epoch + 1) % 5 == 0:
print(f"Epoch {epoch+1:2d} val_acc={val_acc:.4f}")
Key Takeaways
- Tensors are views over flat storage described by strides. Transpose does not copy data; it changes strides. Non-contiguous tensors fail
view()-- usereshape()or.contiguous().view(). - Autograd builds a dynamic computation graph during the forward pass and traverses it in reverse during
.backward(). Gradients accumulate: you must calloptimizer.zero_grad()before every backward pass. model.train()andmodel.eval()switch Dropout and BatchNorm behaviour. Forgetting them causes silent, hard-to-diagnose accuracy differences between training and evaluation.- Always pair
model.eval()withtorch.no_grad()at inference time.eval()changes behaviour;no_grad()stops the graph from being built, saving memory and time. nn.Parameterfor learnable tensors,register_bufferfor non-learnable state that should be saved with the model (running stats, masks, positional embeddings).- Save checkpoints with
state_dictnot the full model object. Include optimizer state to resume training correctly. Usemap_locationwhen loading across devices. - Mixed precision (AMP) with
autocast+GradScalerhalves VRAM usage and accelerates training on modern GPUs with near-zero code changes. pin_memory=True+non_blocking=Truein DataLoader/transfer pipeline enables CPU-GPU data transfer to overlap with GPU computation.
Practice Problems
Problem 1 -- Custom Layer
Implement a GatedLinearUnit (GLU) layer: given input of shape , split it into two halves of shape , and return where is element-wise multiplication and is sigmoid. Register it as an nn.Module with learnable weight and bias. Verify that gradients flow through it correctly by checking that param.grad is non-None after a backward pass.
Problem 2 -- Gradient Norm Monitoring Write a training loop that logs the L2 norm of all parameter gradients after each backward pass (before the optimizer step). Plot the gradient norms over 50 epochs for three models initialised differently: Xavier uniform, Kaiming normal, and all-ones. Observe how bad initialisation leads to vanishing or exploding gradients in the first few epochs.
Problem 3 -- Learning Rate Finder
Implement a learning rate range test (Smith 2015): increase the learning rate exponentially from lr_min=1e-6 to lr_max=10 over 100 mini-batches, record the loss at each step, and plot loss vs learning rate on a log scale. The optimal LR is approximately one decade below where the loss begins to diverge. This is the foundation of the torch-lr-finder library.
Problem 4 -- Custom Autograd Function
Implement the GELU activation function as a custom torch.autograd.Function with explicit forward and backward methods. Compare the gradient values to PyTorch's built-in nn.GELU using torch.autograd.gradcheck. GELU is defined as where is the standard normal CDF, approximated as .
Problem 5 -- Multi-GPU DataParallel
Wrap a model in nn.DataParallel (or nn.parallel.DistributedDataParallel if you have access to a multi-GPU machine). Benchmark training throughput (samples/second) on 1 GPU vs 2 GPUs. Document the scaling efficiency and the overhead sources (batch splitting, gradient reduction, parameter server synchronisation).
